Portuguese Bank Marketing Stratergy- TPOT Tutorial

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. https://archive.ics.uci.edu/ml/datasets/Bank+Marketing


In [1]:
# Import required libraries
from tpot import TPOTClassifier
from sklearn.cross_validation import train_test_split
import pandas as pd 
import numpy as np


/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [2]:
#Load the data
Marketing=pd.read_csv('Data_FinalProject.csv')
Marketing.head(5)


Out[2]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no

5 rows × 21 columns

Data Exploration


In [3]:
Marketing.groupby('loan').y.value_counts()


Out[3]:
loan     y  
no       no     30100
         yes     3850
unknown  no       883
         yes      107
yes      no      5565
         yes      683
Name: y, dtype: int64

In [4]:
Marketing.groupby(['loan','marital']).y.value_counts()


Out[4]:
loan     marital   y  
no       divorced  no      3420
                   yes      396
         married   no     18469
                   yes     2098
         single    no      8155
                   yes     1345
         unknown   no        56
                   yes       11
unknown  divorced  no       113
                   yes        8
         married   no       528
                   yes       60
         single    no       241
                   yes       39
         unknown   no         1
yes      divorced  no       603
                   yes       72
         married   no      3399
                   yes      374
         single    no      1552
                   yes      236
         unknown   no        11
                   yes        1
Name: y, dtype: int64

Data Munging

The first and most important step in using TPOT on any data set is to rename the target class/response variable to class.


In [5]:
Marketing.rename(columns={'y': 'class'}, inplace=True)

At present, TPOT requires all the data to be in numerical format. As we can see below, our data set has 11 categorical variables which contain non-numerical values: job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, class.


In [6]:
Marketing.dtypes


Out[6]:
age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
class              object
dtype: object

We then check the number of levels that each of the five categorical variables have.


In [7]:
for cat in ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome' ,'class']:
    print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, Marketing[cat].unique().size))


Number of levels in category 'job':  12.00 
Number of levels in category 'marital':  4.00 
Number of levels in category 'education':  8.00 
Number of levels in category 'default':  3.00 
Number of levels in category 'housing':  3.00 
Number of levels in category 'loan':  3.00 
Number of levels in category 'contact':  2.00 
Number of levels in category 'month':  10.00 
Number of levels in category 'day_of_week':  5.00 
Number of levels in category 'poutcome':  3.00 
Number of levels in category 'class':  2.00 

As we can see, contact and poutcome have few levels. Let's find out what they are.


In [8]:
for cat in ['contact', 'poutcome','class', 'marital', 'default', 'housing', 'loan']:
    print("Levels for catgeory '{0}': {1}".format(cat, Marketing[cat].unique()))


Levels for catgeory 'contact': ['telephone' 'cellular']
Levels for catgeory 'poutcome': ['nonexistent' 'failure' 'success']
Levels for catgeory 'class': ['no' 'yes']
Levels for catgeory 'marital': ['married' 'single' 'divorced' 'unknown']
Levels for catgeory 'default': ['no' 'unknown' 'yes']
Levels for catgeory 'housing': ['no' 'yes' 'unknown']
Levels for catgeory 'loan': ['no' 'yes' 'unknown']

We then code these levels manually into numerical values. For nan i.e. the missing values, we simply replace them with a placeholder value (-999). In fact, we perform this replacement for the entire data set.


In [9]:
Marketing['marital'] = Marketing['marital'].map({'married':0,'single':1,'divorced':2,'unknown':3})
Marketing['default'] = Marketing['default'].map({'no':0,'yes':1,'unknown':2})
Marketing['housing'] = Marketing['housing'].map({'no':0,'yes':1,'unknown':2})
Marketing['loan'] = Marketing['loan'].map({'no':0,'yes':1,'unknown':2})
Marketing['contact'] = Marketing['contact'].map({'telephone':0,'cellular':1})
Marketing['poutcome'] = Marketing['poutcome'].map({'nonexistent':0,'failure':1,'success':2})
Marketing['class'] = Marketing['class'].map({'no':0,'yes':1})

In [10]:
Marketing = Marketing.fillna(-999)
pd.isnull(Marketing).any()


Out[10]:
age               False
job               False
marital           False
education         False
default           False
housing           False
loan              False
contact           False
month             False
day_of_week       False
duration          False
campaign          False
pdays             False
previous          False
poutcome          False
emp.var.rate      False
cons.price.idx    False
cons.conf.idx     False
euribor3m         False
nr.employed       False
class             False
dtype: bool

For other categorical variables, we encode the levels as digits using Scikit-learn's MultiLabelBinarizer and treat them as new features.


In [11]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

job_Trans = mlb.fit_transform([{str(val)} for val in Marketing['job'].values])
education_Trans = mlb.fit_transform([{str(val)} for val in Marketing['education'].values])
month_Trans = mlb.fit_transform([{str(val)} for val in Marketing['month'].values])
day_of_week_Trans = mlb.fit_transform([{str(val)} for val in Marketing['day_of_week'].values])

In [12]:
day_of_week_Trans


Out[12]:
array([[0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       ..., 
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0]])

Drop the unused features from the dataset.


In [13]:
marketing_new = Marketing.drop(['marital','default','housing','loan','contact','poutcome','class','job','education','month','day_of_week'], axis=1)

In [14]:
assert (len(Marketing['day_of_week'].unique()) == len(mlb.classes_)), "Not Equal" #check correct encoding done

In [15]:
Marketing['day_of_week'].unique(),mlb.classes_


Out[15]:
(array(['mon', 'tue', 'wed', 'thu', 'fri'], dtype=object),
 array(['fri', 'mon', 'thu', 'tue', 'wed'], dtype=object))

We then add the encoded features to form the final dataset to be used with TPOT.


In [16]:
marketing_new = np.hstack((marketing_new.values, job_Trans, education_Trans, month_Trans, day_of_week_Trans))

In [17]:
np.isnan(marketing_new).any()


Out[17]:
False

Keeping in mind that the final dataset is in the form of a numpy array, we can check the number of features in the final dataset as follows.


In [18]:
marketing_new[0].size


Out[18]:
45

Finally we store the class labels, which we need to predict, in a separate variable.


In [19]:
marketing_class = Marketing['class'].values

Data Analysis using TPOT

To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error. The model selection and tuning is entirely taken care of by TPOT, so if we want to, we can skip creating this validation set.


In [20]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(Marketing.index, stratify = marketing_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size


Out[20]:
(30891, 10297)

After that, we proceed to calling the fit(), score() and export() functions on our training dataset. An important TPOT parameter to set is the number of generations (via the generations kwarg). Since our aim is to just illustrate the use of TPOT, we assume the default setting of 100 generations, whilst bounding the total running time via the max_time_mins kwarg (which may, essentially, override the former setting). Further, we enable control for the maximum amount of time allowed for optimization of a single pipeline, via max_eval_time_mins.

On a standard laptop with 4GB RAM, each generation takes approximately 5 minutes to run. Thus, for the default value of 100, without the explicit duration bound, the total run time could be roughly around 8 hours.


In [21]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=2, max_eval_time_mins=0.04, population_size=15)
tpot.fit(marketing_new[training_indices], marketing_class[training_indices])


Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
Optimization Progress: 49pipeline [00:46,  1.10s/pipeline]                  
Generation 1 - Current best internal CV score: 0.913728927925
Optimization Progress: 71pipeline [01:11,  1.06s/pipeline]
Generation 2 - Current best internal CV score: 0.913728927925
Optimization Progress: 95pipeline [01:38,  1.28s/pipeline]
Generation 3 - Current best internal CV score: 0.913728927925
Optimization Progress: 115pipeline [01:58,  1.04s/pipeline]
Generation 4 - Current best internal CV score: 0.913728927925
                                                           
2.00407131667 minutes have elapsed. TPOT will close down.
TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: DecisionTreeClassifier(input_matrix, criterion=gini, max_depth=5, min_samples_leaf=16, min_samples_split=8)
Out[21]:
TPOTClassifier(config_dict={'sklearn.ensemble.GradientBoostingClassifier': {'max_features': array([ 0.05,  0.1 ,  0.15,  0.2 ,  0.25,  0.3 ,  0.35,  0.4 ,  0.45,
        0.5 ,  0.55,  0.6 ,  0.65,  0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,
        0.95,  1.  ]), 'learning_rate': [0.001, 0.01, 0.1, 0.5, 1.0], 'min_samples_... 0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,
        0.95,  1.  ])}, 'sklearn.preprocessing.RobustScaler': {}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        early_stop=None, generations=1000000, max_eval_time_mins=0.04,
        max_time_mins=2, mutation_rate=0.9, n_jobs=1, offspring_size=15,
        periodic_checkpoint_folder=None, population_size=15,
        random_state=None, scoring=None, subsample=1.0, verbosity=2,
        warm_start=False)

In the above, 4 generations were computed, each giving the training efficiency of fitting model on the training set. As evident, the best pipeline is the one that has the CV score of 91.373%. The model that produces this result is one that fits a decision tree algorithm on the data set. Next, the test error is computed for validation purposes.


In [22]:
tpot.score(marketing_new[validation_indices], Marketing.loc[validation_indices, 'class'].values)


Out[22]:
0.91628629697970287

In [23]:
tpot.export('tpot_marketing_pipeline.py')


Out[23]:
True

In [ ]:
# %load tpot_marketing_pipeline.py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=42)

# Score on the training set was:0.913728927925
exported_pipeline = DecisionTreeClassifier(criterion="gini", max_depth=5, min_samples_leaf=16, min_samples_split=8)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

In [ ]:


In [ ]:


In [ ]: